SAA-UNet: Spatial Attention and Attention Gate UNet for COVID-19 Pneumonia Segmentation from Computed Tomography

The disaster of the COVID-19 pandemic has claimed numerous lives and wreaked havoc on the entire world due to its transmissible nature. One of the complications of COVID-19 is pneumonia. Different radiography methods, particularly computed tomography (CT), have shown outstanding performance in effectively diagnosing pneumonia. In this paper, we propose a spatial attention and attention gate UNet model (SAA-UNet) inspired by spatial attention UNet (SA-UNet) and attention UNet (Att-UNet) to deal with the problem of infection segmentation in the lungs. The proposed method was applied to the MedSeg, Radiopaedia 9P, combination of MedSeg and Radiopaedia 9P, and Zenodo 20P datasets. The proposed method showed good infection segmentation results (two classes: infection and background) with an average Dice similarity coefficient of 0.85, 0.94, 0.91, and 0.93 and a mean intersection over union (IOU) of 0.78, 0.90, 0.86, and 0.87, respectively, on the four datasets mentioned above. Moreover, it also performed well in multi-class segmentation with average Dice similarity coefficients of 0.693, 0.89, 0.87, and 0.93 and IOU scores of 0.68, 0.87, 0.78, and 0.89 on the four datasets, respectively. Classification accuracies of more than 97% were achieved for all four datasets. The F1-scores for the MedSeg, Radiopaedia P9, combination of MedSeg and Radiopaedia P9, and Zenodo 20P datasets were 0.865, 0.943, 0.917, and 0.926, respectively, for the binary classification. For multi-class classification, accuracies of more than 96% were achieved on all four datasets. The experimental results showed that the framework proposed can effectively and efficiently segment COVID-19 infection on CT images with different contrast and utilize this to aid in diagnosing and treating pneumonia caused by COVID-19.


Introduction
In December 2019, people began rush Wuhan hospitals with severe pneumonia of unknown cause. After the number of infected people increased, on 31 December, China notified the World Health Organization of the outbreak [1,2]. After several examinations, the virus was found to be a coronavirus with more than 70% similarity to SARS-CoV on 7 January [3]. Coronavirus 2019 is a severe acute respiratory syndrome (SARS-CoV-2), named COVID-19 by the World Health Organization in February 2020 [4]. It is from the beta virus family, which is highly contagious and causes various diseases. One of these viruses appeared in 2003, called severe acute respiratory syndrome (SARS), and another appeared in 2012, the Middle East respiratory syndrome (MERS) [5,6]. The first fatal case of coronavirus was reported on 11 January 2020. As a result, the World Health Organization (WHO) declared a global emergency on 30 January 2020. The number of cases began to increase dramatically due to human-to-human transmission [7]. The infection is transmitted through droplets from the coughing and sneezing by patients, whether they show symptoms or not [8]. These infected droplets can spread from one to two meters and Artificial intelligence, specifically deep learning, has recently played an effective and influential role in medical images. The diagnostic evaluation of medical image data is a human-based technique that requires sufficient time by expert radiologists. Recent advances in artificial intelligence have substituted many personalized diagnostic procedures with computer-aided diagnostic (CAD) methods that can achieve effective real-time diag-noses. As a result, it has an essential role in diagnosing diseases such as infections, cancer, and many other diseases by taking shots of the organ or even the whole body to help radiologists make decisions and plan the stage of treatment. The segmentation task identifies the pixel or voxels that make up the contour or the interior of the region of interest (ROI) as the first stage in computer-aided diagnostics (CAD) [17,18]. Many deep learning algorithms used in image segmentation tasks have succeeded in biomedical images. For example, a fully convolutional network (FCN) was proposed as an end-to-end, pixel-to-pixel network for image segmentation [19], SegNet [20]. UNet was proposed for biomedical image segmentation, in which an encoder-decoder structure with concatenated skip connections yielded significant performance improvements [21], and the modified UNet (UNet++ [22]) and PSPnet [23] have been widely used in medical image segmentation.
This research proposes a spatial attention and attention gate UNet model (SAA-UNet). Additionally, we trained the SAA-UNet model with boundary loss combined with weighted category cross-entropy and Dice loss as a loss function. The framework was used to identify areas of COVID-19 pneumonia and segment regions of interest (ROIs) from computed tomography images. We applied it to four limited datasets published in open sources at the European Institute for Biomedical Imaging Research (EIBIR) [24]. The summary of the contributions of this work is as follows: • We propose the spatial attention and attention gate UNet model (SAA-UNet) based on attention UNet (Att-UNet) and spatial attention UNet (SA-UNet). We took the attention approach proposed by Ozan Oktay et al. [25] to focus on COVID-19 infection regions. The local features vector of infection improved the performance compared to gating established on a global feature vector. We took the spatial attention module (SAM) approach proposed by Changlu Guo et al. [26] to deal with features fed to the bridge of SAA-UNet from the encoder to the decoder. This makes it take essential features needed in spatial information and helps reduce the number of parameters. • SAA-UNet proved to be effective in segmenting the infection areas in CT images of COVID-19 patients. • SAA-UNet showed good generalization when applied to different datasets.
The paper is organized as follows: Section 2 provides the related literature review. Section 3 describes the proposed framework, spatial attention, and attention gate UNet (SAA-UNet) model architecture in detail. Section 4 describes the COVID-19 CT image datasets, and Section 5 explains the analysis and preprocessing of the data. Section 6 shows the experimental results, and Section 7 provides the discussion on the experimental results. Finally, Section 8 concludes the paper and provides future work recommendations.

Related Work
With artificial intelligence (AI) advancements in the health field, many deep learning algorithms have been proposed for medical image processing as segmentation tasks play an essential role in the treatment stage. For example, Ronneberger et al. [21] introduced the standard UNet for biomedical image segmentation. They evaluated UNet on several datasets, including the ISBI Challenge for segmenting neuronal structures in electron microscopic stacks. They achieved an average IOU on the PhC-U373 dataset of 0.92 and on DIC-HeLa of 0.777. Oktay et al. [25] proposed an extension to the UNet architecture. They added an attention mechanism to skip the connection of UNet to focus on the image's region of interest and improve the segmentation. They evaluated attention UNet on the 150 abdominal 3D CT scans from patients diagnosed with gastric cancer dataset and achieved a Dice score of 0.84. The second dataset CT consisting of 82 contrast-enhanced 3D CT scans of the pancreas achieved a Dice score of 0.831. In continuation, Zhao et al. [26] proposed a modification of the UNet architecture that included a spatial attention module in the bridge to focus on the important regions of the image. They evaluated SA-UNet on the Vascular Extraction (DRIVE) dataset and the Child Heart and Health Study (CHASE-DB1) dataset. They achieved F1-scores of 0.826 and 0.815, respectively. Relying on the above, deep learning models can be used to find areas of lung damage caused by 2019-nCoV. Athanasios Voulodimos et al. [27] used an FCN-8s to segment COVID-19 pneumonia and achieved a 0.57 Dice coefficient. They proposed a light UNet model with three stages of the encoder and decoder to deal with the limited datasets of this problem. This achieved a 0.64 Dice coefficient. Sanika Walvekar and Swati Shinde proposed UNet with preprocessing and spatial, color, and noise data augmentation from the MIScnn library with Tversky loss [28]. The Dice similarity coefficient (DSC) for COVID-19 was 0.87 for infection segmentation and 0.89 for the lungs. Imran Ahmed et al. [29] proposed an attention mechanism added to the standard UNet architecture to improve feature representation with binary cross-entropy Dice loss and boundary loss. The Dice score was 0.764 on the validation set. Tongxue Zhou et al. [30] proposed a spatial attention module and a channel attention module added to a UNet architecture with focal Tversky loss. The spatial attention module reweights the feature representation spatially and channelwise to capture rich contextual relationships for better feature representation. The DSC was 0.831. Narges Saeedizadeh et al. [31] proposed a ground-glass recognition system called TV-Unet, a UNet model with a total variation gradient. The loss function was the binary cross-entropy with a total variation term. The DSC achieved 0.86 and 0.76 for two different splits. The combination of two UNet models proposed by Narinder Singh Punna and Sonali Agarwala [31] is called the CHS-NET model. One segments the lungs, and the other segments infection with the weighted binary cross-entropy and Dice loss function. The CHS-NET model uses UNet, Google's Inception model, a residual network, and an attention strategy. The DSC for the lungs was 0.96, whereas for COVID-19 infection, it was 0.81. Tal Ben-Haim et al. [32] proposed a VGG backbone in the encoder of two UNets. The first UNet model segments the lung regions from CT images. The second UNet model extracts the infection or shapes of lesions (GGO and consolidation). For the segmentation of infection with the binary cross-entropy loss, the DSC was 0.80, and for the multi-class weighted cross-entropy (WCE) and Dice loss, the GGO was 0.79 DSC and the consolidation 0.68. A plug-and-play attention module [33] was proposed to extract spatial features by adding to the UNet output. The plug-and-play attention module contains a position offset to build the positional relationship between pixels. This framework achieved 0.839 for the DSC. Ziyang Wang and Irina Voiculescu [34] proposed the quadruple augmented pyramid network (QAP-Net) for multi-class segmentation by establishing four augmented pyramid networks on the encoder-decoder network. These four were two pyramid atrous networks with different dilation rates, the pyramid avg pooling network and the pyramid max pooling network. The mean intersection over union (IOU) score with categorical focal loss was 0.816. Qi Yang et al. [35] used MultiResUNet [36] as the basic model, introduced a new "Residual block" structure in the encoder part, added regularization and dropout, and changed the partial activation function from rectified linear unit (ReLU) activation function to LeakyReLU. The DSC with a combination of binary cross-entropy, focal, and Tversky loss was 0.884. Nastaran Enshaei et al. [37] proposed using the Inception-V3, Xception, InceptionResNet-V2, and DenseNet-121 pre-trained encoders and replacing each fully connected model with the decoder to segment COVID-19 infection. Consequently, the the results of multiple models were aggregated by soft voting for each image pixel. This achieved a Dice score for GGO = 0.627 and consolidation = 0.592 with the categorical cross-entropy. Moreover, Murat Ucar [38] proposed aggregating the pre-trained VGG16, ResNet101, DenseNet121, InceptionV3, and EfficientNetB5 with a pixel-level majority vote to obtain the last class probabilities for each pixel in the image. The Dice coefficient was 0.85 with the Dice loss. Hong-Yang PEI et al. [39] proposed a multi-point supervised network (MPS-Net) based on UNet. The proposed model gave a 0.833 DSC result with a combination of binary cross-entropy and Tversky loss to detect COVID-19 infection. Ümit Budak et al. [40] proposed an A-SegNet network that combines SegNet with the attention gate (AG) mechanism. The DSC score was 0.896 on the validation set with focal Tversky loss.
Alex Noel Joseph Raj et al. proposed an attention gate-dense network-improved dilation convolution UNet (ADID-UNET) based on UNet [41]. ADID-UNet achieved an average Dice score of 0.803 on the MedSeg + Radiopaedia dataset with the Dice loss. Ying Chen et al. proposed a HADCNet model based on UNet that contains hybrid attention modules in five stages of the encoder and decoder [42]. It helps balance the semantic differences between various levels of features, which refines the feature information. HAD-CNet was trained with five-fold cross-validation with the cross-entropy and Dice loss on the MedSeg, Radiopaedia P9, 150 COVID-19 patients, and Zenodo datasets, achieving Dice scores of 0.792, 0.796, 0.785, and 0.723. Nour Eldeen M. Khalifa et al. proposed an architecture of three encoder and decoder stages to deal with the limited datasets problems [43]. The mean IOU score for Zonodo 20P achieved 0.799. Yu Qiu et al. proposed a MiniSeg model to extract multiscale features and deal with limited datasets with 83K parameters [44]. After MiniSeg was trained with five-fold cross-validation with the crossentropy loss on MedSeg, Radiopaedia (P9), Zenodo 20P, and MosMedData, the average Dice scores were 0.759, 0.80, 0.763, and 0.64, respectively. Xiaoxin Wu et al. proposed a focal attention module (FAM) inspired by a residual attention network that contains channel and spatial attention, with a residual branch in the feature map [45]. The focal attention module was applied to the FCN, UNet, SegNet, PSPNet, UNet++, and DeepLabV3+ with binary cross-entropy loss (BCE), where the best was DeepLabV3+ when applied on Zenodo 20P with an average Dice score of 0.885. Feng Xie et al. proposed the double-U-shaped dilated attention network (DUDA-Net) to enhance segmentation [46]. DUDA-Net contains a coarse-to-fine network with a coarse network for lung segmentation and a fine network for infection segmentation. The proposed model was trained with five-fold cross-validation with Tversky loss on infection slices of Radiopaedia 9P with an average Dice score of 0.871 and a mean IOU of 0.771. Vivek Kumar Singh et al. proposed a LungInfseg model based on an encoder and decoder structure [47]. LungInfseg was applied on Zenodo 20P with a combination of blockwise (BWL) and total loss (TL), with an average Dice score of 0.8034. R. Karthik et al. proposed a contour-enhanced attention decoder CNN model with an encoder and decoder structure [48]. The proposed model with the mean pixelwise cross-entropy loss was applied to the Zenodo 20P dataset and had an average Dice score of 0.88; on the MosMedData dataset, the Dice score was 0.837, and on the combination of the Zenodo 20P and MosMedData datasets, the Dice score was 0.854. Kumar T. Rajamani et al. proposed the deformable attention net (DDANet) model [49] based on UNet and criss-cross attention (CCNet) [50]. The proposed model has the same structure as attention UNet [25], with a criss-cross attention module inserted in the bottleneck to capture non-local interactions. DDANet was trained with five-fold cross-validation on the combined dataset of MedSeg and Radiopaedia 9P with multiple classes with class-weighted cross-entropy loss where GGO was 0.734, consolidation was 0.614, and the average Dice score was 0.781.
Three-dimensional algorithms can be used for the overall CT volume of a patient. Keno K. Bressem [51] proposed a pre-trained 3D ResNet block added to the 3D UNet architecture to solve COVID-19 computed tomography image segmentation. The DSC was 0.648, combining the Dice loss and pixelwise cross-entropy loss. Aswathy A. L. and Vinod Chandra [52] proposed a cascaded 3D UNet with two 3D UNet, the first for segment lung volumes and the second for infection volume. The DSC for the lung = 0.925 and infection = 0.82. The 3D algorithms for the segmentation of COVID-19 from CT are rarely used for several reasons, including the computational cost and limited datasets of this problem.
This research proposes a framework to train SAA-UNet in binary and multi-class segmentation using the contrast enhancement method in preprocessing and a combination of the weighted category cross-entropy, Dice, and boundary loss as the loss function. The boundary loss function with regional loss takes useful information from infection bounds from irregular and complex shapes.

Methodology
The spatial attention and attention mechanism UNet model (SAA-UNet) is a proposed state-of-the-art algorithm based on spatial attention UNet (SA-UNet) [26] and attention UNet (Att-UNet) [25] to deal with the complexity of COVID-19 pneumonia images. Moreover, a framework is proposed using the preprocessing method and a combination of weighted category cross-entropy, Dice loss, and boundary loss.
The flowchart of the framework to train the proposed model followed in this research is illustrated in Figure 2. At the beginning, the slices x i are extracted from the CT scan x I if the dataset is not initially of slice images. Afterward, x i are fed to the preprocessing phase, and then, the pixel is classified as either binary or multi-class. Then, the dataset is split into training and testing sets and the training set fed as the input to SAA-UNet to train with 10-fold cross-validation. Next, the trained model is tested on the test set. Finally, the masks of the images of the region of interest (ROI) of COVID-19 damage in the lungs are predicted.
The section is organized as follows: Section 3.1 is the CT preprocessing stage. Section 3.2 is the SAA-UNet model architecture's description with the details of the spatial attention module (SAM) in Section 3.3 and the attention gate (AG) in Section 3.4. After that, Section 3.5 explains the optimization with the combination of the weighted category crossentropy, Dice, and boundary loss. Finally, Section 3.6 displays the performance metrics used to evaluate the SAA-UNet model.

Pre-Processing of Images
The Hounsfield unit (HU) scale is a dimensionless unit utilized in CT images depending on the organ and the disease. The chest CT pixel value intensity of air is −1000, of water is 0, of the lung is −700 to −500, and of the lung tissue is 500 HU to 910 HU, whereas the chest wall, blood, and bone are higher than 500 HU. The HU is used due to the imperfect clarity of CT scan datasets before entering them into the model. The CT scan contrast is different from one dataset to another. As shown in Figure 2, the preprocessing stage begins with the edit Hounsfield unit (HU) histogram if the intensity of x i pixels of air is less than −1000. This means that the datasets have insufficient contrast, so normalizing the air by more than −1000 allows the contrast to increase. The contrast stretching is enhanced when the helpful x i pixels on the left edge are mapped to black and the right ones to white. As a result, the useless pixels are removed by creating a threshold with two cutoff points (1).
These are generated by a Boolean mask from the NumPy array and selecting values between the lower and upper bounds [53]. After enhancing the contrast, x i is normalized and confined between 0 and 1. Then, x i is rotated with the related masks 90 degrees. The final step in the preprocessing is resizing different resolutions of x i by the OpenCV library [54] to decrease the cost compensation with inter-area interpolation, resampling using the pixel area relation.

Spatial Attention and Attention UNet Model
SAA-UNet has an encoder-decoder structure, as shown in Figure 3. The encoder phase has four stages: E1, E2, E3, and E4, which help extract the information from the CT slices' input images. At the beginning, with binary segmentation, x i is fed as the input to E1, consisting of two convolutional layers with a 3 × 3 kernel size, stride 1, and 64 filters, each followed by the ReLU activation function (2), then a 2 × 2 Max-Pooling layer to progressively decrease the spatial size of the representation.
E2, E3, and E4 consist of two 3 × 3 convolutional layers with 128, 256, and 512 filters, respectively, and stride 1. Each convolutional layer is followed by batch normalization, ReLU activation functions, and 2 × 2 Max-Pooling. Each output of Max-Pooling is fed to the next encoder stage. Consequently, the E4 output is fed into the bridge that contains the spatial attention module (SAM). The SAM helps extract the spatial features from all encoder stages and decreases the number of parameters. After that, the output of the SAM F SAM ∈ R H×W×1 is fed to the decoder. Moreover, the extracted features map of each encoder stage are transferred by a skip connection to the corresponding decoder stage as a UNet model. The skip connection contains an attention gate (AG) to focus on essential features. The decoder includes the D1, D2, D3, and D4 stages to determine the spatial information. Each stage has an upsample layer followed by a convolutional layer, batch normalization, and ReLU. The output of ReLU is forwarded to the attention gate (AG) as the second input. The output of the AG F AG ∈ R H×W×1 is concatenated with the second input of the AG and goes to two convolutional layers and two ReLUs. The AGs filter the neuron activations to concentrate on a subset of target structures through the forward and backward passes. Through the backward pass, the gradients originating from background regions are down-weighted. This allows updating the model parameters in shallower layers based on relevant spatial regions. The AG parameters can be trained with the standard back-propagation updates. The D1 convolutional has 512 filters, whereas D2 has 256, D3 has 128 filters, and D4 has 64 filters, the same as the E1 filters. D4's last layer is the 1x1 convolutional layer with a Sigmoid function for predicting binary masks.
where x is the input vector. In contrast, the multi-class segmentation of D4's last layer has the 1 × 1 convolutional with the SoftMax function.
where x j is the input vector, x i is an element of the vector, and k is the number of classes. The multiple classes are learned with multi-dimensional attention coefficients, which have been used to learn sentence embedding [25]. Algorithm 1 explains the pseudocode of the SAA-UNet algorithm.

Spatial Attention Module
The spatial attention module is the informative part that focuses on producing a spatial attention map through the spatial association between features [26,55]. As illustrated in Figure 3, the output of the last E4 layer is fed as the input to the SAM. Figure 4 shown the input feature of SAM is F ∈ R H×W×1 , which is forwarded through the channelwise Max-Pooling and Average-Pooling to generate the outputs F s Max ∈ R H×W×1 and F s Avg ∈ R H×W×1 , respectively. These output feature maps are concatenated to make feature descriptors. Then, this is followed by the convolutional layer with a 7 × 7 kernel size and the Sigmoid activation function. After that, the output of the Sigmoid function layer is elementwise multiplication with E4 to generate a spatial attention map F SAM ∈ R H×W×1 .
where f 7×7 denotes a convolution operation with a kernel size of 7 and σ represents the Sigmoid function.

Attention Gating Module
The attention gate with additive attention focuses on capturing a sufficiently receptive feature map and identifies feature responses to keep only the relevant ones in the region of interest [25]. In this way, it progressively suppresses feature responses in irrelevant background regions without the necessity of cropping a region of interest (ROI). The AG is applied to the features, which are passed to the skip connection from the encoder stage, as shown in Figure 3, to disambiguate irrelevant and noisy responses. The two inputs to the AG are the corresponding encoder's feature map and the decoder stage of deciding on the focus regions. As shown in Figure 5, each of the two inputs is fed to the 1 × 1 convolutional and batch normalization layers, and then, the two outputs are fed to the elementwise addition. After that, the output is fed to the ReLU activation, 1 × 1 convolutional, and batch normalization layers and the Sigmoid activation function. The output of the Sigmoid function is fed to the elementwise multiplication with the output of the last encoder stage layer.
where f 1×1 denotes a convolution operation with a kernel size of 1, B is batch normalization, and σ represents the Sigmoid function.  End for 20 Obtain the final feature map in binary segmentationwith EQ(3) and in multi-class segmentationby EQ(4); 21 End while

Combination of Weighted Cross-Entropy Loss Function, Dice Loss, and Boundary Loss
When the segmentation model segments the infection from an organ, it will likely ignore small-sized anterior layers in the training process, resulting in low segmentation performance. In COVID-19 infection segmentation, the class imbalance problem can be solved using the loss function as an optimization method. In this study, a combination of the weighted cross-entropy loss function and Dice loss was used as the region loss function to combine their usefulness for the imbalanced dataset problem. Moreover, the boundary loss was integrated to take care of the edge information between regions and does not ignore them, like the other region losses.
Weighted cross-entropy loss is used to control category classification to calculate the probability of being a specific class, as proposed by Warren Weaver [56]. The basic formula is where i is the index of the samples, j is the index classes, y is the sample label, and p ij ∈ (0, 1) : ∑ j pij = 1∀i, j is the prediction for a sample. Moreover, m is the number of classes (in binary segmentation, m = 2, which is a special case of category cross-entropy called Bernoulli cross-entropy loss [57]). Dice loss is inspired by the Dice score scale and is widely used in medical image segmentation to handle data imbalance problems. Nevertheless, it addresses the imbalance between foreground and background and between uncomplicated and complex examples that affect a learning model's training process. It can be formulated as follows: where G is the ground truth and P is predicted. The combination of weighted category cross-entropy (CCE) loss and Dice loss as the region loss is given as where w is the respective weight. Boundary loss was proposed by Hoel Kervadec et al. [58,59], motivated by discrete optimization techniques for computing gradient flows of curve evolution. Boundary loss is a loss complimentary to region loss that integratesover the boundary instead of integratingover regions address the unbalanced segmentation problems. It is computed as the distance distribution Dist(∂G, ∂S θ ) between two boundaries in the spatial domain Ω, the G boundary ground truth of the spatial neighbor in the background Ω/G and S θ the boundary segmentation region produced by the network.
The final boundary loss function is formulated as where φ G is pre-computed directly from the ground truth region G, s θ (p) is the SoftMax probability outputs of the network with a constant independent of θ, and dp is independent of the network parameters. Finally, the combination of the region loss (weighted category cross-entropy, Dice loss) with the boundary loss is formulated as where α is a parameter balancing the losses. We started with a low value of α > 0 and increased it gradually at the end of each epoch.

Performance Metrics
Four commonly used performance metrics in the field of medical image segmentation are the Dice coefficient score, the intersection over union (IoU) score, the sensitivity, and the specificity. We also computed the overall accuracy, precision, and F1-score to supplement the efficacy of the proposed model. The evaluation metrics, the accuracy, sensitivity, specificity, precision, and F1-score, were calculated based on the true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs).
Pixel accuracy is the easiest way to evaluate the segmentation model's performance.
Precision is a metric that measures the quality of predictions.
Specificity is also called the true negative rate (TNR) and measures the true negatives correctly determined by the model.
Sensitivity (recall) is used to evaluate the model performance by showing how many positive instances the model correctly identified.
The F1-score is calculated by: The most-common measures to estimate segmentation are the Dice coefficient score and the intersection over union (IOU) score. The Dice coefficient score is two multiplications of the overlapping area between the ground truth and predicted segmentation divided by the total number of pixels in both images. It can be calculated as follows: where G is the ground truth and P is predicted. The IOU is the area of overlap between the predicted segmentation and the ground truth divided by the area of the union between them.
Both the Dice score and the IOU score measure the overlap between the ground truth and the class predicted by the model. Both metrics are always positively correlated. The Dice score is closer to the average performance of the segmentation model, whereas the IoU score represents the worst-case performance of the segmentation model by penalizing the bad classification more.

Datasets
The datasets used to train and evaluate the SAA-UNet model for CT images were published by the European Institute for Biomedical Imaging Research (EIBIR) [24]. Table 1 shows the details of the limited CT datasets for identifying and quantifying the damage caused by COVID-19 in the lungs. The CT scan of one patient contains a set of slices taken simultaneously from different angles; each slice carries specific information about the lung and the damage of infection to it.
The MedSeg dataset [60] contains 100 slices of CT images from more than 40 patients with COVID-19 converted to the Neuroimaging Informatics Technology Initiative format (NIfTI) and is openly accessible from the Italian Society of Radiology (SIRM) [61]. This dataset was segmented by a radiologist using three labels: ground-glass (GGO), consol-idation, and pleural effusion, but because of the rarity of pleural effusion, they deleted it and also added lungs and background masks to this dataset [62]. The Radiopaedia9P dataset [60] includes whole volumes for nine patients' CT scans with 829 slices collected from countries across the globe. It includes positive and negative slices, where the radiologist evaluated 373 out of 829 as positive and segmented. This dataset was converted, annotated, and normalized similarly to the MedSeg dataset [62].
The Zenodo 20P dataset [63] contains the CT scans with 3520 slices of 20 patients infected with COVID-19 collected from countries across the globe. Two radiologists annotated the left lung, right lung, and infection, then this was verified by an experienced radiologist.

Data Analysis and Preprocessing
The Hounsfield unit scales the clarity of the CT scan dataset before entering it into the model. The CT scan contrast is different from one dataset to another. This section analyzes and shows the preprocessing of the MedSeg and Radiopaedia 9P datasets, then the Zenodo 20P dataset. Since the MedSeg and Radiopaedia 9P datasets were preprocessed similarly, they can be combined as one dataset to have a more extensive dataset.

Preprocessing of MedSeg and Radiopaedia 9P Datasets
At the start, the slices and related masks were rotated 90 degrees. The radiologists annotated the mask classes for slices into four classes: ground-glass opacity (GGO), consolidation, lungs, and background. The Hounsfield unit (HU) histogram of the MedSeg dataset (100 slices) shows the intensity of pixels confined between −1606 and 597, shown in Figure 6A. The Hounsfield unit (HU) histogram of the Radiopaedia 9P dataset (829 slices) shows the intensity of pixels confined between −1414 and 291, shown in Figure 6C. These datasets have insufficient contrast, so the enhanced contrast method was used. After enhancing the contrast, the HU was normalized and confined between 0 and 1, as shown in Figure 6B for MedSeg and (D) for Radiopaedia 9P. The last row shows the HU before and after preprocessing for combining the two datasets used with the same preprocessing in Figure 6E,F. Two examples of the CT images before and after using the enhanced contrast method are illustrated in Figure 7. The MedSeg dataset has a 512 × 512 resolution (this dataset was not resized because it is limited). The Radiopaedia 9P dataset and the combination of MedSeg and Radiopaedia 9P also have a 512 × 512 resolution resized by shrinking the slices to 128 × 128 using the OpenCV [54] library with inter-area interpolation to decrease the cost compensation. The inter-area interpolation is calculated based on the ratio to shrink the image: The inter-area interpolation ratio is calculated by resizing this dataset with one channel. This ratio is the number of pixels needed to take their average and give it to one pixel.  In binary segmentation, the GGO and consolidation categories were combined as an infection category because the infection segment from the lung regions was our interest.

Preprocessing of Zenodo 20P Dataset
This dataset is in NIfTI format, so first, we extracted the slices from the CT scan and rotated them 90 degrees. These data contain four classes for multi-class segmentation: infection, left lung, right lung, and background. Consequently, the infection segmentation from the slices was our interest in binary segmentation. As shown in Figure 8A, the HU was confined between 4564 and −1023, where the intensity is more in −1000 for air and 0 for water. We normalized the slices directly between 0 and 1. The HU after normalization is shown in Figure 8B. Some slices had a 512 × 512 resolution and the rest 630 × 630, where one patient scan was 401 × 630. We resized the slices to 128 × 128 using the OpenCV library to decrease the cost computation with inter-area interpolation as the other datasets.

Experiments and Results
To demonstrate the impact of the proposed model, spatial attention and attention UNet (SAA-UNet), we trained it on the four above-mentioned datasets to segment the region of interest (ROI) of damage caused by COVID-19. We trained the model to diagnose whether there was an infection or not. If the infection is present, the model should segment the infection regions. In all experiments, we used all slices of the CT scans, where the CT scans took shots from different angles, and some angles of the slices were taken of a lung region close to other organs.
This section includes the implementation details in Section 6.1, the binary class segmentation experiments in Section 6.2, and the multi-class segmentation experiments in Section 6.3.

Implementation Details
The SAA-UNet model was trained from scratch and implemented with Python 3.8.10, Tensorflow Version 2.9.2, karas 2.9.0, and Google Colab pro+ with GPU. First, we split the datasets into 90% for training and 10% for the testing set, following Ziyang Wang and Irina Voiculescu [34]. After that, we trained the proposed model on a training set with 10-fold cross-validation. K-fold cross-validation is necessary to evaluate the robustness and sensitivity analysis of the proposed model. Hence, ten folds were used to validate the model, and the ten models were trained on the ten validation datasets. Furthermore, these ten trained models were tested on a testing dataset (10% hold-out testing dataset). Table 1 shows the number of slices in each training fold, validation fold, and test set. We used no data augmentation method, like Hoel Kervadec et al. [58]. The Adam optimizer [64] was used, and the learning rate was 10-4. The batch size was set to two, and the training epoch for each fold was 150. The hyperparameters used are shown in Table 2.

Binary Class Classification
This type of segmentation was performed to detect the infection on the CT image and also to extract the region of the infection. Firstly, ten-fold cross-validation was performed, and Table 3 shows the experimental results of the SAA-UNet model in the binary class segmentation. All performance metrics are presented as the mean and standard deviation on the validation dataset of the ten models, and Table 4 shows the mean and standard deviation of the ten models obtained from the 10-fold cross-validation experiment on the testing set. The binary class results of the ten-fold cross-validation in Table 3 show that the SAA-UNet model had the best results on the Radiopaedia 9P dataset and the Zenodo 20P dataset. The mean Dice scores were 0.945 and 0.951 for the Radiopaedia 9P and Zenodo 20P datasets, respectively. The Dice score for the infection region was highest for the Zenodo 20P data, showing a better infection area segmentation. The mean Dice score for the MedSeg dataset was low (0.854) compared to the above-mentioned datasets. The Dice score for the infection class (0.752) was lower than the background class (0.983). If the combination of the MedSeg and Radiopaedia 9P datasets was used for the training, better results were obtained regarding the overall mean Dice score and individual Dice score for both classes. The IOU score defines the ratio of the area of overlap between the predicted segmentation and the ground truth divided by the area of the union between them. The same trend in the IOU score can be observed for all the datasets above. Table 4 summarizes the binary classification results on the testing dataset. The mean Dice and IOU scores were reduced slightly for all the datasets as compared to the results reported on the validation datasets. Some sample predicted slices of the four datasets with COVID-19 infection pixels are illustrated in Figure 9. The confusion matrix of each dataset test set for the SAA-UNet model appears in Figure 10. It shows that SAA-UNet had the closest prediction to the ground truth when predicting images.    In CT imaging, contrast enhancement methods can be applied for infection segmentation to improve the visibility of the areas affected by the infection. This can highlight the areas of interest and make them more distinguishable from the background. However, ensuring that the contrast enhancement does not introduce artifacts or noise that may negatively impact the segmentation accuracy is also important.
The contrast enhancement method was influential in training unclear CT scan images. As shown in Table 5, this method affected the segmentation of the infection and background classes and improved the segmentation process. The MedSeg dataset was improved by 2.4%, whereas Radiopaedia 9P was improved by 1.3% with respect to the mean Dice score compared to without the enhanced contrast method. Furthermore, the mean IOU of MedSeg was improved by 2.2%, whereas the Radiopaedia 9P was the same. This contrast enhancement method of the poor contrast in the MedSeg dataset significantly improved all the evaluation metrics. In addition, the Radiopaedia 9P dataset was improved in the sensitivity and the mean Dice score, especially the Dice score for infection, positively affecting recognizing the foreground and distinguishing it from the background by 1.3%. Figure 12 shows an example of a predicted slice of the same slice fold trained with and without the contrast enhancement method. It is easy to notice that the improvement happened after enhancing the contrast of the MedSeg dataset, whereas for Radiopaedia-9P, the mask was predicted almost as well as without contrast enhancement with no adverse effect on it. As a generalization of the training model, we tested each SAA-UNet trained on one dataset and tested on different datasets. The results are shown in Table 6. The performance metrics decreased while testing on other datasets compared to testing on the same dataset. First, the model trained on Radiopaedia had good results while testing on the MedSeg dataset than on the Zenodo 20P dataset. The SAA-UNet model trained on the Zenodo 20P dataset had the best generalization when tested on the other datasets after applying the contrast enhancement method to ensure the effectiveness of this method. The number of CT slices on which the model was trained and the apparent contrast of the original images led to this generalization and gave promising results. The model trained on the MedSeg dataset showed better results when tested on Radiopaedia 9P than the Zenodo 20P dataset. In contrast, the models trained on the MedSeg + Radiopaedia 9P dataset and then tested on the Zenodo 20P dataset were better than those trained on MedSeg. SAA-UNet had a good generalization for the different training dataset experiments.

Multi-Class Classification
In the multi-class classification experiment, we explored the use of SAA-Unet on many classes, including the lung region and infection or different types of infections. Five classes were considered: background, lungs, infection, consolidation (type of infection), and GGO (type of infection). GGO is a condition in which air is displaced by fluid in the lungs and visible in the CT images as an area of increased attenuation. If a region of normally compressible lung tissue is filled with liquid, it is called pulmonary consolidation. GGO is described as an increase in density with visible blood vessels, whereas the consolidation condition is an increase in the parenchyma density, which conceals the blood vessels. The classification results were obtained on the available classes of the three datasets. The Med-Seg dataset had four classes, Radiopaedia 9P four classes, and the Zenodo 20P dataset only three classes. Table 7 illustrates the results of the proposed model in multi-class segmentation as the mean and standard deviation of the validation of the ten models. The mean Dice score was highest for the Zenodo 20P dataset (0.94) and lowest for the MedSeg dataset (0.685). There was only one infection class in the Zenodo 20P dataset, not explaining the type of infection. The infection was also identified in the other two datasets (GGO or consolidation). The mean Dice score for the Radiopaedia 9P dataset was higher than the MedSeg dataset. When combined, the mean Dice score was better than the MedSeg dataset. The other performance metrics also showed the same trend for the three datasets. Table 8 shows the mean and standard deviation of the ten models from the 10-fold cross-validation on the testing dataset. All the performance metrics decreased slightly on the testing dataset, but the trend remained the same. It can be seen from Tables 7 and 8 that the segmentation of the infection (GGO and Con classes) had a lower Dice score compared to the lungs and background classes. This is because the labeled data for the infection classes were much fewer than for the lung and background classes. Segmenting smaller infection areas with diffused boundaries is challenging, whereas segmenting more significant infection areas with clear boundaries and good contrast is easier. It is evident from Table 7 that increasing the number of labeled slices improved the Dice score of the infection classes in the case of the combination of the MedSeg and Radiopaedia P9 datasets.  Some sample predicted slices of the three datasets with COVID-19 infection pixels are illustrated in Figure 13. It shows that SAA-UNet had a good prediction of the ground truth. The confusion matrix of each dataset for the SAA-UNet model appears in Figure 14. For the MedSeg dataset, both the GGO and Con classes were confused with the lung class (25% and 11%, respectively). The same trend to a lesser extent was observed for the Radiopaedia 9P dataset as well. Figure 13. The predicted CT slices of the best fold for each dataset in multi-class segmentation. In the first three datasets, white is for GGO, grey is for consolidation, green is for lungs, and brown is for the background. For the last dataset, white is for infection, brown for the left lung, grey for the right lung, and lighter greyfor the background. Furthermore, we tried different split ratios of the datasets as 80%, 20%, and 70%, 30% for the training and testing set, respectively. Figure 15 shows the average Dice score and IOU for the different split ratios in all four datasets. Different split ratios were tried for both binary and multi-class classification. Figure 15 shows that the effect of different split ratios was insignificant. Both the Dice score and IOU were not degraded significantly when the training ratio was decreased from 90% to 70%. A slight decrease in the Dice score and IOU suggested that more training data improved the segmentation results.

Discussion
Our proposed method SAA-UNet showed good performance on various datasets related to COVID-19 pneumonia. SAA-UNet also showed better generalization when trained on one dataset and tested on the other datasets. This showed the generalization ability of the SAA-UNet model. The weaker segmentation of the infection classes was also due to the variety of GGO and pulmonary consolidation morphologies. Moreover, a smaller number of pixels classified incorrectly in the image segmentation significantly impacted the Dice coefficient score and IOU [65]. A comparative analysis of different methods used to segment COVID-19 infection in the lung from CT slices is shown in Table 9. In the case of the MedSeg dataset, binary class segmentation, SAA-UNet had better results for the mean Dice score than the other reported results. In the case of the Radiopaedia P9 dataset, our proposed method also outperformed other reported methods, showing a Dice score of 0.94 for binary classification and 0.897 for multi-class classification. The Dice score of SAA-UNet was the best in the case of the Zenodo 20P dataset. Our proposed model performed equally well in binary and multi-class classification (0.95 for binary and 0.94 for multi-class classification). This was due to the Zenodo dataset (2851 slices) containing a higher number of slices available for training as compared to the MedSeg (81 slices) and Radiopaedia P9 (671 slices) datasets in Table 1. Combining the MedSeg and Radiopaedia P9 datasets containing more than 49 subjects produced good binary and multi-class classification results. This showed the efficacy of our proposed model for various datasets in binary and multi-class classification. After comparing to other published results, the SAA-UNet method performed better in the region of interest (ROI) segmentation. It can quantify the severity of the infection and the patient's condition. Therefore, it can be one of the best methods to be used by the doctor in a follow-up study of the patient. As is evident from Table 5, enhancing the contrast of the CT scan improved the classifier's performance for better segmentation of the infection. Shiri et al. [66] showed a high overall Dice score (0.98 for lungs and 0.91 for lesions) on a dataset of volumetric CT exams to classify the lungs and pneumonia lesions. Our paper provides more rigorous testing of the model in binary and multi-class classifications. To further prove the generalization ability of our model, we trained the model on one dataset and tested it on the rest of the datasets. The source code of the model is available upon request.

Conclusions
Diagnosing diseases using computer-aided diagnostic (CAD) methods improves the detection of diseases in real-time. The proposed method, spatial attention and attention mechanism UNet (SAA-UNet), was based on spatial attention UNet (SA-UNet) and attention UNet (Att-UNet) to deal with the challenging structures of COVID-19 pneumonia. SAA-UNet can focus on the foreground to extract the lesion from computed tomography slices. Moreover, the training was optimized by the combination of weighted category cross-entropy loss, Dice loss, and boundary loss, which is useful to extract the hazy edges of the infection and deal with highly imbalanced datasets. The efficacy of the proposed model was established by testing on various datasets, including a smaller number of slices (MedSeg) and more patients (Radiopaedia P9). The performance of the SAA-UNet model was also compared with other reported models. In future work, we will optimize this model further to a larger number of infections in MRI, CT scan, or X-ray images.